A Perl Package and an Alignment Tool for Phylogenetic 

Networks 



Gabriel Cardona Prancesc Rossello 

Department of Mathematics Research Institute of Health Science 

and Computer Science University of the Balearic Islands 

1^ ] University of the Balearic Islands E-07122 Palma de Mallorca 

O ; E-07122 Palma de Mallorca Spain 

Spain 



O 



>■ I Gabriel Valiente 

3 ' Algorithms, Bioinformatics, Complexity 

and Formal Methods Research Group 



C^I , Technical University of Catalonia 



m 

d 

•i-H 

cr: 



E-08034 Barcelona 
Spain 

February 2, 2008 



Abstract 



Phylogenetic networks are a generalization of phylogenetic trees that allow for the 
representation of evolutionary events acting at the population level, like recombina- 
^ ■ tion between genes, hybridization between lineages, and lateral gene transfer. While 

OO I most phylogenetics tools implement a wide range of algorithms on phylogenetic trees, 

CN| ■ there exist only a few applications to work with phylogenetic networks, and there are no 

open-source libraries either. In order to improve this situation, we have developed a Perl 
package that relies on the BioPerl bundle and implements many algorithms on phyloge- 
netic networks. We have also developed a Java applet that makes use of the aforemen- 

j . tioned Perl package and allows the user to make simple experiments with phylogenetic 

I networks without having to develop a program or Perl script by herself. The Perl 

package has been accepted as part of the BioPerl bundle. It can be downloaded from 
the url http : //dmi .uib . es/~gcardona/BioInf o/Bio-PhyloNetwork. tgz. The wcb- 
. based application is available at the url http://dmi.uib.es/~gcardona/BioInfo/. 

H ' The Perl package includes full documentation of all its features. 

Background 

We briefly recall some definitions and results from [2\ on pliylogenetic networks. 

A phylogenetic network on a set S of taxa is any rooted directed acyclic graph whose 
leaves (those nodes without outgoing edges) are bijectively labeled by the set S. 

Let = {V, E) be a phylogenetic network on S. A node u \s said to be a tree node 
if it has, at most, one incoming edge; otherwise it is called a hybrid node. A phylogenetic 
network on S" is a tree- child phylogenetic network if every node either is a leaf or has at 
least one child that is a tree node. 

Let S = . . . be the set of leaves. We define the ^-vector of a node n G U as 
the vector /u(n) = (mi(n), . . . ,mn{u)), where mi{u) is the number of different paths from 
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u to the leaf ii. The multiset fJ.{N) = | v € V} is called the ^-representation of 

N and, provided that is a tree-child phylogenetic network, it turns out to completely 
characterize A^, up to isomorphisms, among all tree-child phylogenetic networks on S. 

This allows us to define a distance on the set of tree-child phylogenetic networks on S: 
the n-distance between two given networks A'^i and is the symmetric difference of their 
/Lt-representations, 

d^{Ni,N2) = I/i(iVi)A^(iV2)|. 

This defines a true distance, and when A'^i and are phylogenetic trees, it coincides with 
the well-known partition distance [H]. 

This representation also allows us to define an optimal alignment between two tree- 
child phylogenetic networks on S, say n = \S\. Given two such networks A^i = {Vi,Ei) and 
-^2 = {V2,E2) (where, for the sake of simplicity, we assume |Vi| ^ IV2I), an alignment is 
just an injective mapping M : Vi — > V2- The weight of this alignment is 

w{M) = {\Hv)-KMivm+xiv,Miv))), 

where || • || stands for the Manhattan norm of a vector and x(^^) 'v) is if both u and v are 
tree nodes or hybrid nodes, and l/(2n) if one of them is a tree node and the other one is a 
hybrid node. An optimal alignment is, then, an alignment with minimal weight. 

The Extended Newick Format 

The eNewick (for "extended Newick") string defining a phylogenetic network appeared in 
the packages PhyloNet [7J and NetGen [5j related to phylogenetic networks, with some 
differences between them. The former encodes a phylogenetic network with k hybrid nodes 
as a series of k trees in Newick format, while the latter encodes it as a single tree in Newick 
format but with k repeated nodes. 

Whereas the Perl module we introduce here accepts both formats as input, a complete 
standard for eNewick is implemented, based mainly on NetGen and following the sugges- 
tions of D. Huson and M. M. Morin (among others), to make it as complete as possible. The 
adopted standard has the practical advantage of encoding a whole phylogenetic network as 
a single string, and it also includes mandatory tags to distinguish among the various hybrid 
nodes in the network. 

The procedure to obtain the eNewick string representing a phylogenetic network A^ goes 
as follows: Let {Hi, . . . , Hfn} be the set of hybrid nodes of A^, ordered in any fixed way. For 
each hybrid node H = Hi, say with parents ui,U2, ■ ■ ■ ,Uk and children vi,V2, ■ ■ ■ ,vf. split 
H in k different nodes; let the first copy be a child of ui and have all vi,V2, ■ ■ ■ ,Vi as its 
children; let the other copies be children of U2, ■ ■ ■ ,Uk (one for each) and have no children. 
Label each of the copies of H as 

[label] # [type] tag [ : branch_length] 

where the parameters are: 

• label (optional) string providing a labelling for the node; 

• type (optional) string indicating if the node H corresponds to a hybridization (indi- 
cated by H) or a lateral gene transfer (indicated by LGT) event; note that other types 
can be considered in the future; 
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Figure 1: A phylogenetic network N (left), and tree (right) associated to N for computing 
its eNewick string. 




Figure 2: Representation of a lateral gene transfer event (left) as a hybrid node in a 
phylogenetic network (right). 

• tag (mandatory) integer i identifying the node H = Hi. 

• branch_length (optional) number giving the length of the branch from the copy of 
H under consideration to its parent. 

In this way, we get a tree whose set of leaves is the set of leaves of the original network 
together with the set of hybrid nodes (possibly repeated). Then, the Newick string of 
the obtained tree (note that some internal nodes will be labeled and some leaves will be 
repeated) is the eNewick string of the phylogenetic network. The leftmost occurrence of 
each hybrid node in an eNewick string corresponds to the full description of the network 
rooted at that node, and although node labels are optional, all labeled occurrences of a 
hybrid node in an eNewick string must carry the same label. 

Consider, for example, the phylogenetic network depicted together with its decomposi- 
tion in Figure m The eNewick string for this network would be ((1, (2)#H1) , (#H1,3)) ; 
or ((1, (2)h#Hl)x, (h#Hl,3)y)r; if all internal nodes are labeled. The leftmost occurrence 
of the hybrid node in the latter string corresponds to the full description of the network 
rooted at that node: (2)h#Hl. 

Obviously, the procedure to recover a network from its eNewick string is as simple as 
recovering the tree and identifying those nodes that are labeled as hybrid nodes with the 
same identifier. 

Notice that gene transfer events can be represented in a unique way as hybrid nodes. 
Consider, for example, the lateral gene transfer event depicted in Figure [21 where a gene 
is transferred from species 2 to species 3 after the divergence of species 1 from species 
2. The eNewick string ((1, (2, (3)h#LGTl)y)x,h#LGTl)r; describes such a phylogenetic 
network. A program interpreting the eNewick string can use the information on node types 
in different ways; for instance, to render tree nodes circled, hybridization nodes boxed, and 
lateral gene transfer nodes as arrows between edges. 
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The Perl Module 



The Perl module Bio: :PhyloNetwork implements all the data structures needed to work 
with tree-child phylogenetic networks, as well as algorithms for: 

• reconstructing a network from its eNewick string (in all its different flavours), 

• reconstructing a network from its /^-representation, 

• exploding a network into the set of its induced subtrees, 

• computing the /i-representation of a network and the /x-distance between two net- 
works, 

• computing an optimal alignment between two networks, 

• computing tripartitions [H [3] and the tripartition error between two networks, and 

• testing if a network is time consistent [1], and in such a case, computing a temporal 
representation. 

The underlying data structure is a Graph: : Directed object, with some extra data, for 
instance the /i-representation of the network. It makes use of the Perl module Bio: :Phylo 
Network: :muVector that implements basic arithmetic operations on /i- vectors. Two extra 
modules. Bio: :PhyloNetwork: : Factory and Bio: :PhyloNetwork: :RandoniFactory, are 
provided for the sequential and random generation (respectively) of all tree-child phyloge- 
netic networks on a given set of taxa. 

The web interface and the Java applet 

The web interface, available at http://dmi.uib.es/~gcardona/BioInfo/, allows the user 
to input one or two phylogenetic networks, given by their eNewick strings. A Perl script 
processes these strings and uses the Bio: :PhyloNetwork package to compute all available 
data for them, including a plot of the networks that can be downloaded in PS format; these 
plots are generated through the application GraphViz and its companion Perl package. 

Given two networks on the same set of leaves, their /i-distance is also computed, as well 
as an optimal alignment between them. The algorithm to compute such an alignment relies 
on the Hungarian algorithm [6]. If their sets of leaves are not the same, their topological 
restriction on the set of common leaves is first computed followed by the //-distance and an 
optimal alignment. 

A Java applet displays the networks side by side, and whenever a node is selected, 
the corresponding node in the other network (with respect to the optimal alignment) is 
highlighted, provided it exists. This is also extended to edges. Similarities between the 
networks are thus evident at a glance and, since the weight of each matched node is also 
shown, it is easy to see where the differences are. 
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